PreTrendsPower: Roth (2022) paper review + silent-failure guard by igerber · Pull Request #463 · igerber/diff-diff

igerber · 2026-05-17T11:44:27Z

Summary

New Roth (2022) paper review at docs/methodology/papers/roth-2022-review.md (298 lines) — methodology audit for the PreTrendsPower estimator, with registry-candidate block holistically split into ### Paper-derived requirements + ### Library design choices subsections and explicit boundary markers.
REGISTRY.md ## PreTrendsPower: 3 new **Note:** blocks documenting deviations surfaced by the audit — diag-Σ_22 fallback on CS/SA adapters, count-based linear weights (not Roth's δ_t = γ·t for irregular grids), and the new power_at(custom) silent-failure guard.
diff_diff/pretrends.py: PreTrendsPowerResults.power_at() now raises NotImplementedError for violation_type="custom" instead of silently substituting equal weights; unknown-violation_type else branch hardened to raise ValueError (mirrors _get_violation_weights).
METHODOLOGY_REVIEW.md: PreTrendsPower row records paper review on file (2026-05-17); promotion checklist updated.
TODO.md: 5 new rows under Methodology/Correctness tracking deferred follow-ups (Σ_22 fidelity, R-package revision pin, helper custom support, weight persistence, linear-units alignment).
New regression tests: test_power_at_raises_on_custom_violation_type + test_compute_pretrends_power_rejects_custom_violation_type + test_compute_mdv_rejects_custom_violation_type (93/93 passing across test_pretrends.py + test_pretrends_event_study.py).

Methodology references (required if estimator / math changes)

Method name(s): PreTrendsPower (paper-derived methodology audit); PreTrendsPowerResults.power_at() (silent-failure guard).
Paper / source link(s): Roth, J. (2022). Pretest with Caution: Event-Study Estimates after Testing for Parallel Trends. AER:I 4(3), 305-322. https://doi.org/10.1257/aeri.20210236
Any intentional deviations from the source (and why) — all documented inline in REGISTRY.md ## PreTrendsPower:
1. CS/SA covariance source: shipped adapter uses diag(ses^2) rather than full pre-period Σ_22 for staggered result types (CS deliberate where event_study_vcov is available; SA forced because SunAbrahamResults does not expose an event-study VCV surface). Not provably conservative; tracked in TODO.md for the PR-B audit.
2. linear violation pattern: _get_violation_weights("linear") constructs [n_pre-1, ..., 1, 0] from n_pre alone — does not honor actual relative-time labels, so reported MDV is not in Roth's γ units for irregular / anticipation-shifted grids. Tracked in TODO.md.
3. Helper/class API split + power_at(custom) silent-failure guard: compute_pretrends_power / compute_mdv reject violation_type="custom" (no violation_weights parameter); the class-direct path works at fit time but PreTrendsPowerResults.power_at() now raises NotImplementedError for custom fits (was a silent equal-weights fallback at pretrends.py:230-231). Tracked in TODO.md for weight-persistence follow-up.

Validation

Tests added/updated: tests/test_pretrends.py — 3 new regression tests as listed above. Full test_pretrends.py + test_pretrends_event_study.py: 93/93 passing.
Backtest / simulation / notebook evidence: N/A — paper-review-scoped PR; no tutorial or notebook changes.

Security / privacy

Confirm no secrets/PII in this PR: Yes

🤖 Generated with Claude Code

PR-A of the 3-PR PreTrendsPower methodology review sequence (paper review -> implementation audit -> R parity goldens). Produces the structured methodology contract that will inform the PR-B audit of diff_diff/pretrends.py against the paper's exact equations. The review covers: - Propositions 1-4 (bias decomposition, sign-of-bias under monotone trend, variance decomposition, convexity gives variance reduction) - Power calculation algorithm for the slope gamma_p at target power p - Bias and coverage calculation framework (analytical via tmvtnorm-equivalent truncated MVN moments; simulation fallback) - Implementation notes and 11-item gaps list flagging items the PR-B audit will need to resolve (notably: NIS vs Wald acceptance region, violation-type extensions beyond linear, R-package version pin for parity) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

- P1 covariance block convention: add explicit "post-first" block decomposition paragraph defining Sigma_11=Var[beta_hat_post], Sigma_22=Var[beta_hat_pre], Sigma_12=Cov(beta_hat_post, beta_hat_pre). Roth's stacked vector is (pre, post)' but his proofs partition Var[(beta_hat_post, beta_hat_pre)'] post-first; without the explicit convention, Sigma_22 references in the propositions were ambiguous. - P1 plug-in variance dimensional bug: change sigma^2_{tau_hat} = l' Sigma l to l' Sigma_11 l (since l in R^M and Sigma_11 = Var[beta_hat_post] is the only block of correct dimensions). Add explicit reminder pointing back to the convention. - P1 missing Note/Deviation labels in copy-ready Registry section: inline **Note (deviation from paper):** for slope-of-best-fit acceptance region, constant/last_period/custom violation types, and the slope pretest_form enum value. Mark joint Wald + custom B(Sigma) as paper-supported framework (Propositions 1, 3, 4 apply) but not separately tabulated. Mark "individual" (NIS) as the only paper-analyzed form. - P3 brittle line reference: replace "line 2758" with the heading reference "## PreTrendsPower stub". - Algorithm step 2 cleanup: clarify sigma_t = sqrt(Sigma_22[t, t]) under the new convention. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

- P2 API surface drift: add explicit Status disclaimer at top of "Methodology Registry Entry" section flagging the block as proposed-replacement-text-not- yet-merged (mirrors the established goodman-bacon-2021-review.md:14 pattern). Restructure the Tuning Parameters table to add a Status column distinguishing "Current" (matches diff_diff/pretrends.py:442-447) from "Proposed" (requires API extension via the PR-B audit), so the block is not mistaken for the shipped surface. - P2 Assumption 1 design-heuristic overclaim: drop the "balanced panel + cluster-robust + equicorrelated" design-metadata heuristic. Roth's Assumption 1 is a condition on the estimated Sigma, not on the experimental design. Replace with the requirement that any sharper sign-of-bias warning must come from a direct numerical check of Sigma's structure. - P3 dead internal reference: remove the parenthetical pointer to `feedback_verdict_powered_by_tools.md` (a Claude private-memory artifact, not a git-tracked file in this repo). - P3 source-of-truth split: addressed by the Status disclaimer above, which explicitly names the existing PreTrendsPower REGISTRY stub as the sole authoritative methodology contract until PR-B lands. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

P1 Proposition 4 convexity overclaim: R1/R2 fixes added language to the Alternative-acceptance-regions block, requirements checklist, pretest_form row, and acceptance_region row that said "Propositions 1, 3, 4 apply to any B(Sigma)". That overstates Roth: Proposition 1 (conditional mean) and Proposition 3 (conditional variance) hold for any measurable B, but Proposition 4 (variance reduction / over-coverage under parallel trends) begins "Suppose that B(Sigma) is a convex set." Nonconvex custom pretests lose the variance-reduction guarantee even though the conditional-moment formulas still hold. Fixed: - "Custom user-supplied B(Sigma)" bullet now distinguishes Props 1+3 (any B) from Prop 4 (requires convex B) and explicitly notes nonconvex pretests lose the over-coverage guarantee. - Requirements checklist now reads "joint Wald (convex, Props 1+3+4 all apply); custom B (Props 1+3 apply to any measurable B, Prop 4 only if B is convex)". - `pretest_form` row in tuning parameters: same convexity split. - `acceptance_region` row: "Propositions 1, 3 apply to any measurable B; Proposition 4 / variance-reduction guarantee additionally requires B to be convex". Joint Wald (line 47) already correctly conditioned on convexity ("apply to this B since it is convex") and was left as-is. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

- P1 nonlinear-trend "deviation" overstatement: walk back the "**Note (deviation from paper):**" labels on nonlinear violation hypotheses (constant level shift, last-period jump, custom delta vector). Roth's Section III ("Practical Recommendations") explicitly endorses power analyses against hypothesized nonlinear trends through the `pretrends` package, so the GENERAL nonlinear-power capability is paper-supported; only the SPECIFIC named API parameterizations are R-package conventions. Updates: Requirements checklist line and Gaps "Nonlinear violations" + "Custom delta" bullets reframed to distinguish paper-supported framework (Section III + Propositions 1+3 applying to any trend) from package-API parameterizations. - P1 absolute "UNDER-cover" claim: Roth's wording after Prop 4 is "CIs will tend to under-cover if the bias is sufficiently large" — the under-coverage direction is contingent on bias magnitude, not universal. Updated the Implication paragraph after Proposition 4 to keep the conditionality ("tend to UNDER-cover **if the bias is sufficiently large** ... not universal"). - P2 "stub" mischaracterization: the live `## PreTrendsPower` entry at REGISTRY.md:2758-2808 is a populated 50-line block (Wald-test framed), not a stub. Updated the Status header to describe it accurately as "a populated 50-line block framed primarily around a joint-Wald pre-trends test" and clarify that this file is a non-authoritative source audit. Slope-of-best-fit "deviation from paper" labels (lines 48, 159, 195) kept as-is: Roth does NOT analyze a slope-t-stat as an acceptance region in Section I or Section III; the slope t-stat in Table 1 is only an observed property of surveyed papers. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

- P1 alpha/1.96 internal inconsistency: the proposed registry block exposed `alpha` as a parameter but hard-coded 1.96 in B_NIS, the plug-in CI, the power formula, the K=1 closed-form, and the coverage formulas. Promoted to REGISTRY this would document the wrong inference rule for any alpha != 0.05. Generalize all five formulas to z_{1-alpha/2}; keep "=1.96" parenthetical annotations to ground them at the paper's running default. - P2 sample vs population identity: "beta_hat_pre, which equals delta_pre under no anticipation" muddied Roth's stochastic model. Roth Equations (1)-(2) keep beta_hat (random draw) and beta (mean) distinct; under no anticipation, beta_pre = delta_pre (population), not beta_hat_pre = delta_pre. Reworded to "beta_hat_pre, whose mean beta_pre equals delta_pre under no anticipation" with explicit "random draw is not itself equal to delta_pre" caveat. - P2 "11 of 12" overcount: Section I.B of Roth (2022) does not support "11 of 12 use individual significance" as the count I wrote. The paper reports: 12 of 12 plot pointwise CIs (visual-NIS inspection possible), 5 of 12 explicitly discuss individual significance, 1 of 12 reports a joint-significance test, several rely on visual inspection without a formal criterion. Restated using the paper's own breakdown and framed NIS as the implicit common denominator rather than a literal explicit- rule majority. - P3 paper methodology vs library extension labeling: the Figure-1-style plotting and HonestDiD-composition checklist items inside the copy-ready block conflated paper-derived methodology with diff-diff design choices. Added "**Note (library extension):**" labels to both, distinguishing the paper-derived numerical content (bias and CI by gamma_p; methodology- agnostic on how beta_hat/Sigma_hat is produced) from the library presentation and cross-estimator composition decisions. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Per user direction (option A: holistic restructure) after three consecutive rounds of "paper-derived vs library-extension boundary" P1s (R3 Prop 4 convexity, R4 nonlinear-trend "deviation", R6 analytical/simulation as paper-required). The recurring class warranted naming the invariant and fixing it structurally rather than another surface patch. Holistic fix: - Renamed the single Requirements-checklist block into two named subsections inside the copy-ready Registry entry: ### Paper-derived requirements (must satisfy to be faithful to Roth 2022) ### Library design choices (extensions beyond Roth 2022) - Moved each non-paper item (simulation fallback, method/n_sim knobs, pretest_form/acceptance_region surface, named non-linear violation parameterizations, Figure-1 plotting interface, HonestDiD composition) into the Library Design Choices subsection with explicit framing that the library may keep, drop, or extend each item via the follow-up audit. - Reworded the Tuning Parameters table's Status column to a clearer Source column ("Paper-derived" vs "Library extension") so every row's origin is unambiguous. P2 (Table 2 factual error): - Coverage bullet under "Empirical Findings" said unconditional gamma_{0.8} range was "53% to 98%". Actual Table 2 range for tau_bar unconditional under gamma_{0.8} is 14% (Deschenes et al. 2017) to 98% (Lafortune et al.; Markevich & Zhuravskaya). Fixed. P3 (transient process / line refs): - Dropped "PR-B in the 3-PR sequence" language from the Status banner and Tuning Parameters note. - Replaced literal "diff_diff/pretrends.py:442-447" with stable symbol refs (`compute_pretrends_power`, `PreTrendsPowerResults`). - Replaced "lines 2758-2808" REGISTRY range with a non-line-numbered description ("populated block framed primarily around a joint-Wald pre-trends test"). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

- P1 joint-Wald + tmvtnorm still in paper-derived checklist: the R6 restructure narrowed the recurring "paper vs library" class but two items remained mislabeled — joint Wald (a paper-mentioned alternative used by 1 of 12 surveyed papers, not the empirical exercise) and the tmvtnorm-specific analytical backend (one of several valid numerical paths; the paper requires producing the right quantities, not using this specific solver). Moved both to "Library design choices (paper-supported alternatives and extensions)" with explicit framing that the paper requirement is at the QUANTITY level (conditional moments, power, bias, coverage), not at the API or numerical-backend level. Updated the subsection heading to "paper-supported alternatives and extensions beyond Roth 2022" to make the two categories explicit. - P2 Proposition 4 summary row dropped "if bias is sufficiently large": the Key Theorems table row claimed CIs "UNDER-cover under violations" unconditionally, contradicting the conditional version stated correctly elsewhere in the file. Rewrote the row to limit the variance-reduction guarantee to its strict statement and pushed the CI coverage discussion into a clearly-conditioned addendum. - P2 API misattribution: I had said `compute_pretrends_power` exposes `alpha, power, violation_type, violation_weights` (which is actually the `PreTrendsPower` class signature). The helper's actual signature is `compute_pretrends_power(results, M, alpha, target_power, violation_type, pre_periods)`. Updated the Status banner and Tuning Parameters note to name both surfaces explicitly with their actual signatures, and clarified that the follow-up audit will reconcile them with each other and with the proposed contract. - Related P2-adjacent overclaim: line 217 said the library "implements a Roth-2022 framework"; line 270 said it "implements computations under arbitrary Sigma via Proposition 1". Both overstate what `pretrends.py` currently does. Tightened to "Wald-test-based pre-trends MDV/power workflow ... computes Wald power/MDV from the pre-period VCV block rather than the full arbitrary-Sigma Proposition 1/3/4 conditional-moment computations", and reframed the heteroskedastic- Sigma gap as a future capability conditional on the Prop 1/3/4 path being added. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

R8 verdict was ✅ Looks good (no P0/P1) — addressing remaining polish items. - P2 internal contradiction on simulation-fallback requirement: the "Computational shortcut (footnote 8)" paragraph said the library "should support both an analytical truncated-multivariate-normal path AND a simulation fallback", contradicting the Library design choices subsection which correctly classifies simulation fallback as a library robustness choice. Rewrote the paragraph to state the paper-derived requirement at the quantity level (compute the correct conditional moments and probabilities) and explicitly defer backend choice to the Library design choices subsection. - P2 joint-Wald R-parity overclaim: the Gaps section said "library should implement both but test against R `pretrends` for the joint-Wald case (Roth's package supports it)." Verified via R-package source/docs that the published `pretrends` package exposes only NIS-based surfaces (`pretrends()`, `slope_for_power()`, `*_NIS` helpers) — no joint-Wald parity target. Rewrote to make explicit that joint Wald is theoretically admissible under the propositions but parity would need an independent fixture/derivation, not direct R-package parity. - P3 composition overclaim: the Relation section listed `TwoWayFixedEffects` and "any estimator producing an event-study coefficient vector" as currently composable, but the shipped `compute_pretrends_power` adapter only dispatches `MultiPeriodDiDResults`, `CallawaySantAnnaResults`, `SunAbrahamResults` and raises `TypeError` otherwise. Split into "Currently composes with" (the three shipped adapter targets) and "theoretical compatibility" (extends to TWFE etc., pending follow-up audit adapters). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

- P1 CS/SA covariance-source deviation: the R7 edit said current `pretrends.py` "computes Wald power/MDV from the pre-period variance- covariance block", but only the `MultiPeriodDiDResults` branch extracts the full pre-period sub-VCV (pretrends.py:592-601). The CS branch (pretrends.py:609-652) and SA branch (pretrends.py:660-687) hard-code `vcov = np.diag(ses^2)`, dropping the pre-period correlations the paper's Propositions rely on. Added an explicit "Note (deviation in current covariance-source)" bullet in the "Relation to Existing diff-diff Estimators" section documenting the per-result-type split and cross-referencing the conservative-variance framing in REPORTING.md. The follow-up audit will decide whether to extend full-sub-VCV extraction to CS/SA or keep the diag fallback as a deliberate variance- conservatism choice. - P1 Assumption 1 model-vs-estimated Sigma: the R2 edit said "Assumption 1 is a condition on the *estimated covariance matrix* Sigma". Roth's Section II.B states Assumption 1 on the *model* covariance Sigma (the population VCV in Equation 1). Software can only inspect the *estimated* Sigma_hat, so any direct numerical check is a HEURISTIC implementation proxy, not the paper's assumption itself. Rewrote the Prop 2 Implementation-use cell to (a) name Sigma as the model covariance, (b) acknowledge Sigma_hat as the implementation surrogate, (c) require any sharper warning based on Sigma_hat be labeled heuristic. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

- P1 "conservative variance source" framing for diag fallback is wrong: R9 fix added a Note saying the CS/SA `diag(ses^2)` fallback could be retained as a "deliberate variance-conservatism choice". Roth's Propositions 1+3+4 (and the MDV/power computation w' Σ_22^{-1} w) operate on the full pre-period covariance block Σ_22 — dropping the off-diagonals is a non-paper approximation, not a paper-supported numerical option, and the direction of the discrepancy with the full- Σ_22 calc depends on the sign and magnitude of the dropped correlations (not guaranteed conservative). Reworded the Note to describe `diag(ses^2)` explicitly as a non-paper approximation; if the follow-up audit retains it, an explicit REGISTRY.md Note describing the approximation + its possible miscalibration is required instead. - P3 R package version not pinned: R `pretrends` package API claims in the Joint-Wald and version-pin Gaps bullets were temporally unstable. Softened the Joint-Wald observation to "as observed at the time of this review (specific commit not pinned)" with an explicit ask for the follow-up audit to record the exact revision. Strengthened the version-pin Gaps bullet to enumerate exactly which surface claims depend on the unpinned R-package surface (NIS-only, no joint-Wald, `pretrends()` / `slope_for_power()` / `*_NIS` helpers). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

R11 verdict was ✅ Looks good (second consecutive) — addressing remaining polish items. - P2 CS/SA covariance-source distinction: R9/R10 Note lumped CS and SA together as "even when an event-study VCV is available". That is true for CallawaySantAnnaResults (which persists event_study_vcov at staggered_results.py:126-128) but NOT for SunAbrahamResults (which does not expose an event-study or cohort covariance surface; verified in sun_abraham.py:30-88). Split the Note into per-result-type bullets: CS diag fallback is a deliberate choice (off-diagonals available); SA diag fallback is forced (off-diagonals not persisted on the result object, so Roth-faithful off-diagonal support requires upstream surface work first). - P3 unsourced K<=5 / K>10 cutoffs: the "For K <= 5, analytical methods are fast. For K > 10, simulation is preferable" guidance in Computational Considerations was a hunch, not benchmarked or paper-derived. Relabeled as "tentative heuristic (not benchmarked in this review and not specified by the paper)" with explicit ask for the follow-up audit to benchmark or replace. - P3 R-package version pin still implicit: R10 fix softened the Joint Wald observation to "as observed at the time of this review" but the reviewer wanted a more explicit catch-all. Strengthened the Gaps bullet to enumerate every R-package surface claim that needs re-verification at a pinned revision (NIS-based surface, no joint- Wald target, named helpers, nonlinear-trend support claim) and label them collectively provisional. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

R12 verdict was ✅ Looks good (third consecutive) — addressing remaining polish items. - P2 "uninformative pretests are unambiguously harmful" overstates II.D: Roth's Section II.D + Equation (4) say the net publication-screening effect is ambiguous, with underpowered pretests being "least effective and potentially harmful" — not unambiguously harmful in every parameterization. Internally inconsistent with my own L266 summary ("net sign depends on which dominates"). Reworded the Publication-bias edge-case bullet to match the paper's own framing. - P3 "adding more pre-periods does NOT help (and can hurt)" overstates Section I.D: paper only says additional distant pre-periods may not help / may be uninformative for near-treatment shocks. Softened to "may not help / may be uninformative" without the active-harm claim. - P3 gamma_max = "largest |gamma| at which power = 1" is impossible: under the normal model power approaches 1 only asymptotically, so no finite gamma_max satisfies power(gamma_max) = 1 exactly. Replaced with a doubling-expansion + bisection bracketing strategy. - P3 audit-notes vs registry-candidate boundary unclear: added an explicit HTML-comment boundary marker just before "## Implementation Notes", retitled that heading to "(audit notes — NOT registry- candidate)", and updated the Status banner to call out the boundary explicitly so tentative heuristics + provisional R-package claims cannot be misread as part of the proposed REGISTRY replacement text. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ates R13 verdict was ✅ Looks good (4th consecutive) — addressing remaining polish items. - P2 Wald vs NIS power-object conflation: the CS/SA covariance-source Note said "the key MDV/power quantity is w' Σ_22^{-1} w". That scalar is the current library's Wald / noncentral-χ² object, not Roth's paper-analyzed NIS power object. Roth's NIS power is the multivariate box probability P(β̂_pre ∈ B_NIS(Σ)) computed under β̂_pre ~ N(δ_pre, Σ_22), and Proposition 1's pretest-bias term + Proposition 3's conditional-variance term rely on truncated-MVN moments. Reworded the Note to separate the two surfaces: Roth's NIS box probability and the current library's Wald form are BOTH non-conservative under the diag fallback, but for different reasons, and the audit needs to pick one of them as the surface to promote. - P3 scipy.stats over-restriction: replaced "scipy.stats has only the univariate case" with the accurate distinction — `scipy.stats.multivariate_normal.cdf` does cover box probabilities, but SciPy lacks a `tmvtnorm`-equivalent for the truncated-MVN MOMENTS that Propositions 1 + 3 need. - P3 discoverability links (sibling updates, follows Bacon precedent at REGISTRY.md:2611): * REGISTRY.md `## PreTrendsPower` entry now includes `Paper review on file: docs/methodology/papers/roth-2022-review.md` inline with the primary source citation, with explicit note that REGISTRY remains authoritative until the follow-up audit lands. * METHODOLOGY_REVIEW.md PreTrendsPower section's "Documentation in place" list now records the paper review with the 2026-05-17 date; "Outstanding for promotion" has the paper-review bullet removed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

R13's REGISTRY pointer pulled the new audit's diag-fallback finding into authoritative scope; R14 flagged that the authoritative `## PreTrendsPower` REGISTRY entry didn't yet document the deviation. Adding the inline documentation per the CLAUDE.md "Documenting Deviations" convention. - P1 REGISTRY missing `**Note (deviation from paper):**` for diag-VCV fallback: added a Note inside the `## PreTrendsPower` Edge cases block documenting all three result-type paths (MultiPeriodDiD: full sub-VCV when interaction_indices populated, diag otherwise; CS: diag fallback even with event_study_vcov available — deliberate; SA: diag fallback forced — SunAbrahamResults has no event-study VCV surface). The Note cross-references the paper review file and the new TODO entries. - P2 `compute_pretrends_power(..., violation_type="custom")` is broken today: added an explicit "**Helper/class API gap observed today**" call-out in the Status block of the paper review, naming the missing `violation_weights` parameter on the helper and clarifying that `"custom"` is class-only today. - P3 R-package version pin + diag-fallback follow-up: added three TODO.md rows under "Methodology/Correctness": * Diag fallback follow-up (Medium) — route CS through event_study_vcov + extend SunAbrahamResults with event-study/cohort VCV, or formally retain diag with miscalibration framing. * R pretrends commit pin (Low) — record the audited revision before building parity fixture. * compute_pretrends_power "custom" helper gap (Low) — add violation_weights to helper signature or document helper as supporting only linear/constant/last_period. All three TODO rows tagged "PR-A (Roth paper review, 2026-05-17)" for provenance. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

R15 verdict was ✅ Looks good (5th overall ✅) — addressing remaining polish. - P3 CS event_study_vcov availability stated too broadly: previous wording in REGISTRY.md, roth-2022-review.md, and TODO.md said CallawaySantAnnaResults "persists event_study_vcov", but bootstrap CS fits explicitly clear that matrix at staggered.py:2032-2036 (to prevent mixing analytical VCV with bootstrap SEs). Qualified all three surfaces to split non-bootstrap (persists event_study_vcov, diag fallback is a deliberate choice) vs bootstrap (clears event_study_vcov, diag fallback is the only path) CS paths. - P3 "four power calculations" parity target underspecified: the PreTrendsPower row in METHODOLOGY_REVIEW.md said future R parity should cover "the four power calculations" without acknowledging that `compute_pretrends_power(..., violation_type="custom")` is unusable from the helper today (helper does not accept violation_weights). Added clarification that "custom" parity has to run through PreTrendsPower directly until the helper is extended; helper-only parity is limited to linear/constant/last_period. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

R16 verdict was ✅ Looks good (6th overall ✅), with all findings classified P3-informational. Two of the four findings ("None in this PR" / "TODO rows are correct") required no action; addressing the two with concrete polish suggestions. - P3 helper docstring: `compute_pretrends_power` and `compute_mdv` both accept `violation_type` but don't accept `violation_weights`, so `violation_type="custom"` is unusable from either helper today. Added an explicit Note in the `violation_type` docstring entry of both convenience functions pointing users to instantiate `PreTrendsPower` directly for custom weights, and cross-referencing the TODO row that tracks the helper-extension follow-up. - P3 METHODOLOGY_REVIEW.md "four power calculations" target: now reads "at a pinned revision" with cross-reference to the TODO row that tracks the R-package revision pin. Until that lands, the R-package surface claims in the paper review remain provisional. Pyright diagnostics in pretrends.py (matplotlib import, numpy/list type conflicts, ndarray-to-tuple coercion) are pre-existing in code I did not touch; my edits are docstring-only. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

R17 escalated to ⛔ Blocker after the agentic codex reviewer dug into pretrends.py and surfaced two pre-existing library bugs that my R16 helper docstring update inadvertently directed users toward: - P0 verified: PreTrendsPowerResults.power_at() silently wrong for violation_type="custom". The result dataclass does not persist fitted violation_weights (pretrends.py:77-90), and power_at() falls back to equal weights with an inline "we can't reconstruct" comment at line 230-231. My R16 docstring change pointed users to the broken class path. - P1 verified: linear violation pattern at _get_violation_weights() (pretrends.py:510-515) ignores actual pre-period relative-time labels and constructs a shifted [n-1,...,0] direction from n_pre count alone, so for irregular / anticipation-shifted grids the reported MDV is NOT in Roth's γ units. Per user direction, keeping PR-A paper-review-scoped and deferring both bugs to PR-B (the implementation audit): - Added two High-priority TODO.md rows tagged "PR-A (surfaced by R17 of the iterative codex review)" with explicit line refs + concrete fix outlines (persist normalized fitted weights on PreTrendsPowerResults for the P0; thread relative-time labels through _get_violation_weights for the P1). - Reverted the R16 helper-docstring recommendation that pointed custom users to instantiate PreTrendsPower(..., violation_weights=...). The helper now describes the helper-vs-class API gap AND the PreTrendsPowerResults.power_at() known issue, without recommending the broken class path. Both gaps cross-referenced to TODO.md. Pyright diagnostics in pretrends.py (matplotlib import, numpy/list type conflicts, etc.) are pre-existing in code I did not touch; my edits are docstring + TODO.md only. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ar-deviation Note R18 rejected the R17 TODO-only deferral: "a TODO/docstring note does not mitigate a silent statistical bug." Per user direction, applying the minimal partial fix that codex's verdict explicitly offered as acceptable. - P0 silent-failure guard on PreTrendsPowerResults.power_at: rather than doing the full weight-persistence refactor (deferred to PR-B), the method now raises NotImplementedError for violation_type="custom" and points users to refit with the new M instead. The previous equal-weights fallback (with the literal "we can't reconstruct" comment) is replaced with a fail-loud guard at the top of power_at; the inner else branch is retained as a defensive fallback for future violation_type values. Added a Raises section to the docstring documenting the guard. Added a regression test `test_power_at_raises_on_custom_violation_type` to lock the behavior in; full test_pretrends.py suite passes 64/64. - P1 REGISTRY linear-violation deviation Note: added a `**Note (deviation from paper — linear violation pattern):**` block inside the `## PreTrendsPower` entry's Violation-types block describing how the shipped `[n_pre-1, ..., 1, 0]` direction is built from `n_pre` count alone (no relative-time labels), so irregular / anticipation-shifted grids do not yield Roth's γ-units MDV. Cross-referenced to TODO.md. - Silent-failure-guard Note: also added a paired Note documenting the power_at custom raise behavior in REGISTRY.md for symmetry with the linear deviation Note (both label the corresponding TODO follow-up). Pre-existing pyright diagnostics in pretrends.py are unrelated to this diff (my edits are confined to power_at, docstrings, REGISTRY, and a new test method). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

R19 verdict was ✅ Looks good — addressing all 3 cleanup items so the documentation, tracking, and tests align with R18's actual mitigation. - P2 helper docstrings stale: `compute_pretrends_power` and `compute_mdv` docstrings still said `power_at()` "silently falls back to equal weights" even though R18 changed the behavior to raise `NotImplementedError`. Updated both docstrings to describe the actual current behavior: helper raises `ValueError` from the underlying constructor for `violation_type='custom'`; `PreTrendsPowerResults. power_at()` on a custom class fit raises `NotImplementedError` because fitted weights are not yet persisted. - P3 TODO.md row stale wording: `TODO.md:100` row still described the silent-failure path in present tense. Rewrote to record that the silent-failure path was mitigated in R18 (locked by test_power_at_raises_on_custom_violation_type) and narrowed the remaining follow-up to the weight-persistence work needed to re-enable `power_at()` for custom fits. Downgraded priority from High to Medium since the silent-bug surface is closed. - P3 helper-level regression test: added two new tests to `TestConvenienceFunctions`: * `test_compute_pretrends_power_rejects_custom_violation_type` — confirms ValueError when `compute_pretrends_power(..., violation_type='custom')` is called without weights. * `test_compute_mdv_rejects_custom_violation_type` — mirrors the contract for `compute_mdv`. Full test_pretrends.py suite: 66/66 passing (was 64 + 2 new helper tests). Pre-existing pyright diagnostics in pretrends.py + test_pretrends.py are unrelated to this diff. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…allback R20 verdict was ✅ Looks good. The two methodology + tech-debt P3-informational findings explicitly required "none required for approval" (already tracked via REGISTRY notes + TODO rows). Addressed the one finding with a concrete fix: - P3 maintainability — PreTrendsPowerResults.power_at() still had an unknown-violation_type else fallback that defaulted to equal weights (np.ones(n_pre)). That meant a future violation_type added to fit() but not threaded through power_at() would silently produce equal-weights output. Replaced with an explicit ValueError mirroring the raise at the end of _get_violation_weights() (line 530-531). Aligns the two weight-reconstruction paths and closes the future-drift silent surface. Test suite: 66/66 passing. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

github-actions · 2026-05-17T11:49:18Z

Overall Assessment

✅ Looks good

Executive Summary

The only runtime behavior change in scope is PreTrendsPowerResults.power_at() for violation_type="custom", and it fixes a real silent-failure mode by raising instead of silently substituting equal weights. diff_diff/pretrends.py:L196-L259
That guard is consistent with the new authoritative methodology note and TODO tracking, so I do not see an unmitigated P0/P1 in the changed estimator path. docs/methodology/REGISTRY.md:L2796-L2814, TODO.md:L99-L100
The helper/class custom split is now explicit and regression-tested instead of being implicit. diff_diff/pretrends.py:L1076-L1175, tests/test_pretrends.py:L527-L609
The remaining Roth-2022 deviations called out here (linear weight units on irregular grids; diagonal Σ_22 fallback) are documented in REGISTRY.md and tracked in TODO.md, so under the stated rubric they are P3-informational rather than blockers. docs/methodology/REGISTRY.md:L2796-L2814, TODO.md:L97-L101
Minor follow-up only: one surrounding methodology doc still calls the diagonal fallback “conservative,” which now conflicts with the new authoritative registry language. docs/methodology/REPORTING.md:L324-L337, docs/methodology/REGISTRY.md:L2809-L2814

Methodology

Severity P3-informational. Impact: The PR removes a silent wrong-answer path in PreTrendsPowerResults.power_at() for custom fits and replaces it with a documented loud failure. This matches the new REGISTRY.md note, so it is a mitigation, not a new methodology defect. Concrete fix: none required for approval; the tracked follow-up is to persist fitted violation_weights on the results object and re-enable power_at() for custom fits. diff_diff/pretrends.py:L196-L259, docs/methodology/REGISTRY.md:L2798-L2798, TODO.md:L100-L100
Severity P3-informational. Impact: The PR correctly records two existing deviations from Roth (2022): the count-based linear direction is not in Roth’s γ units on irregular grids, and some adapters still use diagonal Σ_22 fallback. Because both deviations are explicitly documented in REGISTRY.md, they are not blockers under the review policy. Concrete fix: none required for approval; the follow-up work is already tracked. docs/methodology/REGISTRY.md:L2796-L2814, TODO.md:L97-L101

Code Quality

No findings.

Performance

No findings.

Maintainability

Severity P3-informational. Impact: PreTrendsPowerResults.power_at() still re-implements weight reconstruction instead of consuming persisted fitted weights, so future violation-type additions still require edits in more than one place. The new ValueError prevents silent corruption, but the drift risk remains. Concrete fix: persist normalized fitted weights on PreTrendsPowerResults and compute power_at() from the stored vector. This is already tracked in TODO.md. diff_diff/pretrends.py:L239-L264, diff_diff/pretrends.py:L516-L559, TODO.md:L100-L100

Tech Debt

No new untracked tech debt. The remaining PreTrendsPower follow-ups are explicitly captured in TODO.md:L97-L101.

Security

No findings.

Documentation/Tests

Severity P3. Impact: REPORTING.md still describes the diagonal-covariance fallback as conservative, but the new authoritative registry/TODO language says dropping off-diagonals is not provably conservative. That leaves two methodology docs disagreeing about the same behavior. Concrete fix: align docs/methodology/REPORTING.md with the new REGISTRY.md note and remove the conservative framing unless it can be proven. docs/methodology/REPORTING.md:L324-L337, docs/methodology/REGISTRY.md:L2809-L2814
Severity P3. Impact: The new paper-review file says the reviewed PDF was papers/roth-2022.pdf, but there is no such PDF in the repo tree, so the audit is not fully reproducible from the repository alone. Concrete fix: add the PDF at the documented path, update the path to the real local location, or state that the PDF was reviewed externally and keep only the DOI. docs/methodology/papers/roth-2022-review.md:L3-L7
Added regression coverage is appropriate for the changed code paths in tests/test_pretrends.py:L527-L609. I could not execute the tests in this environment because pytest is not installed.

CI codex Round 1 verdict was ✅ Looks good. Addressing the two actionable P3 doc findings; the other 4 P3 findings were explicit "no fix required" mitigation confirmations. - P3 REPORTING.md parity: the diagonal-covariance-fallback Note still called the deviation "conservative" at REPORTING.md:324-337, but the new authoritative REGISTRY.md note (in this PR) describes it as a non-paper approximation that is NOT provably conservative (direction of the discrepancy with full Σ_22 depends on sign + magnitude of dropped correlations). Aligned the REPORTING note to match the REGISTRY language, kept the description of the BusinessReport well_powered → moderately_powered downgrade as a practical safeguard (not a proof of conservatism), and cross-referenced the REGISTRY authoritative deviation block. - P3 paper-review reproducibility: the roth-2022-review.md header listed "PDF reviewed: papers/roth-2022.pdf" but the /papers/ directory is gitignored, so the path isn't reproducible from the repo alone. Replaced the path with the published-article citation + DOI + an explicit note that the PDF was reviewed externally and that the /papers/ working directory is gitignored. Added pointers to the jonathandroth.com author page for reproduction. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

github-actions · 2026-05-17T12:01:18Z

🔁 AI review rerun (requested by @igerber)

Head SHA: b3f818100a13bc48d96044ac4bd69a8027fce4aa

Overall Assessment

Looks good

Executive Summary

The only runtime behavior change in scope is PreTrendsPowerResults.power_at() for violation_type="custom", and it replaces a silent equal-weights fallback with a loud NotImplementedError. diff_diff/pretrends.py:L196-L279
That change matches the authoritative methodology note now added to REGISTRY.md, so I do not see an unmitigated P0/P1 in the changed estimator path. docs/methodology/REGISTRY.md:L2796-L2814
The helper/class split for custom violation patterns is now explicit in both docstrings and regression tests, rather than being an implicit failure mode. diff_diff/pretrends.py:L1076-L1183, tests/test_pretrends.py:L527-L609
The prior doc drift from the last review is resolved: REPORTING.md no longer calls the diagonal-VCV fallback conservative, and the new paper review clarifies that the reviewed PDF was external rather than expected in-repo. docs/methodology/REPORTING.md:L324-L344, docs/methodology/papers/roth-2022-review.md:L1-L15

Methodology

Severity P3-informational. Impact: PreTrendsPowerResults.power_at() now fails loudly for custom fits because the result object does not persist fitted violation_weights, eliminating the prior silent wrong-output path. This behavior is explicitly documented in the authoritative registry, so it is mitigated rather than a new methodology defect. Concrete fix: none required for approval; the follow-up to persist fitted weights is already tracked. diff_diff/pretrends.py:L196-L279, docs/methodology/REGISTRY.md:L2798-L2814, TODO.md:L100-L100
Severity P3-informational. Impact: The remaining Roth-2022 mismatches surfaced by the audit, count-based linear weights on irregular grids and diagonal Σ_22 fallback on CS/SA adapters, are now documented with note labels in REGISTRY.md and tracked in TODO.md. Under the stated rubric, that makes them mitigated, not P1 methodology defects. Concrete fix: none required for approval; follow-ups already tracked. docs/methodology/REGISTRY.md:L2796-L2814, TODO.md:L97-L101

Code Quality

No findings.

Performance

No findings.

Maintainability

Severity P3-informational. Impact: PreTrendsPowerResults still reconstructs built-in weight vectors from violation_type and cannot reuse fitted custom weights, so results.power_at(M) remains unavailable for custom fits. The PR correctly converts that from silent corruption into explicit API behavior, but the duplication remains. Concrete fix: persist normalized fitted violation_weights on the results object and have power_at() consume the stored vector. diff_diff/pretrends.py:L77-L90, diff_diff/pretrends.py:L196-L279, TODO.md:L100-L100

Tech Debt

Severity P3-informational. Impact: The PR properly records the remaining follow-ups in TODO.md rather than leaving them implicit: Σ_22 fidelity, pinned R revision, helper custom support, custom-weight persistence, and linear-units alignment. There is no new untracked tech debt in scope. Concrete fix: none required for approval. TODO.md:L97-L101

Security

No findings.

Documentation/Tests

No findings. The changed docs now agree with each other on the diagonal-VCV fallback, and the paper-review file now makes its source provenance explicit. The new regression coverage targets all newly guarded custom entry points. docs/methodology/REPORTING.md:L324-L344, docs/methodology/papers/roth-2022-review.md:L1-L15, tests/test_pretrends.py:L527-L609
I could not execute the tests in this environment because pytest is not installed.

igerber and others added 21 commits May 17, 2026 07:43

igerber added the ready-for-ci Triggers CI test workflows label May 17, 2026

igerber merged commit f6499db into main May 17, 2026
33 of 34 checks passed

igerber deleted the feature/roth-2022-paper-review branch May 17, 2026 16:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PreTrendsPower: Roth (2022) paper review + silent-failure guard#463

PreTrendsPower: Roth (2022) paper review + silent-failure guard#463
igerber merged 22 commits into
mainfrom
feature/roth-2022-paper-review

igerber commented May 17, 2026

Uh oh!

github-actions Bot commented May 17, 2026

Uh oh!

github-actions Bot commented May 17, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

igerber commented May 17, 2026

Summary

Methodology references (required if estimator / math changes)

Validation

Security / privacy

Uh oh!

github-actions Bot commented May 17, 2026

Uh oh!

github-actions Bot commented May 17, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant